sceptre part 7In this writeup I examine the following:
How does sample size vary across datasets?
How does the “exact” version of SCEPTRE (i.e., the version of SCEPTRE that returns an “exact” p-value based on B = 300,000 resamples) perform on the Frangieh IFN-gamma and Papalexi gene data?
A question that I seek to answer is whether SCEPTRE exhibits CAMP, or “Confounder Adjustment via Marginal Permutations” (previously known as “implicit confounder adjustment”).
Sample size has emerged as an important quantity in low MOI (and even in high MOI). Here, I study issues related to sample size in a more precise and in-depth way than I have previously.
I begin by defining several quantities –
n_nonzero_treatment, n_nonzero_control, and
effective_sample_size – on the negative control data. Let a
dataset be given. Let \(d\) be the
number of NTCs, let \(p\) be the number
of genes, and let \(n\) be the number
of cells in the dataset. Let \(X \in
\mathbb{R}^{n \times p}\) be the cell-by-gene expression matrix.
Next, for \(k \in \{1, …, d\},\) let
\(n_k\) be the number of cells
containing NTC \(k\) (where the NTCs
are arbitrarily indexed). Also, let \(X^{1}
\in \mathbb{R}^{n_1 \times p}\) be the submatrix of \(X\) consisting of the cells that received
NT 1 (and similarly for \(X^2, \dots,
X^d\)). For a given NTC \(k\)
and gene \(j\), let
$$\texttt{n_nonzero_treatment}_{kj}$$ be the number of cells containing
NTC \(k\) in which gene \(j\) has nonzero expression, i.e.
\[ \texttt{n_nonzero_treatment}_{jk} = \sum_{i=1}^{n_k} \mathbb{I}(X^k_{i,j} \geq 1).\]
Next, let \(\texttt{n_nonzero_control}_{kj}\) be the number of NT-containg cells excluding NTC \(k\) (i.e., the cells containing NTCs in the set \([d] \setminus \{k\}\)) with nonzero expression, i.e.,
\[ \texttt{n_nonzero_control}_{jk} = \sum_{k' \neq k} \texttt{n_nonzero_treatment}_{k'j} \]
Finally, let effective_sample_size be the sum of
n_nonzero_treatment and n_nonzero_control. On
the negative control data, effective_sample_size is a
function of the gene only:
\[ \texttt{effective_sample_size}_j = \sum_{k = 1}^d \texttt{n_nonzero_treatment}_{jk} \]
I study how effective_sample_size,
n_nonzero_treatment, and n_nonzero_control
relate to one another.
effective_sample_size,
n_nonzero_treatment, and
n_nonzero_controlFirst, I plot n_nonzero_treatment against
effective_sample_size, faceting by dataset. In other words,
I plot
\[ \{ (\texttt{n_nonzero_treatment}_{jk}, \texttt{effective_sample_size}_{j})\}_{j,k} \]
for each dataset.
Clearly, n_nonzero_treatment and
effective_sample_size are correlated. This makes sense: for
a given gene, if expression is high in cells with a given NTC, then
expression likely is high in cells with the remaining NTCs. (Put
differently, expression levels are correlated across NTCs: highly
expressed genes tend to be highly expressed across NTCs, and similarly
for lowly expressed genes.)
Next, I plot n_nonzero_control against
effective_sample_size, i.e.
$$\{(\texttt{n_nonzero_control}_{jk}, \texttt{effective_sample_size}_{j})\}_{j,k}$$
Clearly, n_nonzero_control and
effective_sample_size are basically the same variable.
Again, this makes sense: for a given gRNA,
n_nonzero_control is equal to
effective_sample_size minus the number of cells with
nonzero expression containing that gRNA. Finally, I plot
n_nonzero_treatment against
n_nonzero_control.
Again, these variables are correlated (but certainly not perfectly)
for the same reason that n_nonzero_treatment and
effective_sample_size are correlated.
I conclude on the basis of this analysis that, while
effective_sample_size is a useful single-number summary of
sample size, it is meaningful to look at both at
n_nonzero_treatment and n_nonzero_control when
examining sample size-related issues.
Here, I make some comparisons related to sample size across datasets. First, I make a barplot of the number of NTCs per dataset.
The Frangieh data contain the greatest number of NTCs, followed by
the Schraivogel data and then the Papalexi data. Next, I plot the
average number of nonzero cells per NTC (i.e.,
n_nonzero_treatment) across datasets.
I print the median n_nonzero_treatment value for each
dataset below.
sample_size_df_nt |>
group_by(dataset) |>
summarize(m = median(n_nonzero_treatment))
## # A tibble: 7 × 2
## dataset m
## <fct> <dbl>
## 1 frangieh_co_culture_gene 9
## 2 frangieh_control_gene 6
## 3 frangieh_ifn_gamma_gene 7
## 4 papalexi_eccite_screen_gene 29
## 5 schraivogel_enhancer_screen 52.5
## 6 schraivogel_ground_truth_perturbseq_gene 37
## 7 schraivogel_ground_truth_tapseq_gene 165
The Frangieh data have the smallest number nonzero cells per NTC
(median under 10). The Papalexi data are indermediate (median about 30),
and the Schraivogel data have the greatest number of cells per NTC
(median 37 on perturb-seq data; even greater median on other datasets).
Finally, I plot the effective_sample_size per dataset.
I print the per-dataset median effective sample size below.
sample_size_df_nt |>
group_by(dataset) |>
summarize(m = median(effective_samp_size))
## # A tibble: 7 × 2
## dataset m
## <fct> <dbl>
## 1 frangieh_co_culture_gene 712
## 2 frangieh_control_gene 436
## 3 frangieh_ifn_gamma_gene 581
## 4 papalexi_eccite_screen_gene 315
## 5 schraivogel_enhancer_screen 1764
## 6 schraivogel_ground_truth_perturbseq_gene 1186
## 7 schraivogel_ground_truth_tapseq_gene 5775
We see that the effective sample sizes are in a similar range on the Frangieh and Papalexi datasets (while the effective sample sizes on the Schraivogel datasets are larger). The Frangieh data have fewer nonzero cells per NTC but many more NTCs than the Papalexi data, causing the effective sample size to be roughly equal across the Frangieh and Papalexi data.
Now, I examine the number of nonzero cells per targeting gRNA (grouped and ungrouped) across datasets.
sample_size_df_t |>
group_by(dataset) |>
summarize(median_nonzero = median(n_nonzero_cells))
## # A tibble: 7 × 2
## dataset median_nonzero
## <fct> <dbl>
## 1 frangieh_co_culture_gene 5
## 2 frangieh_control_gene 3
## 3 frangieh_ifn_gamma_gene 4
## 4 papalexi_eccite_screen_gene 17
## 5 schraivogel_enhancer_screen 14
## 6 schraivogel_ground_truth_perturbseq_gene 26
## 7 schraivogel_ground_truth_tapseq_gene 117
Next, I plot the number of nonzero cells per gRNA group:
## `summarise()` has grouped output by 'dataset', 'grna_group'. You can override
## using the `.groups` argument.
First, I plot the number of genes per dataset (after applying mild QC of filtering out genes expressed in fewer than 0.005 of cells).
All datasets have roughly the same number of genes (about 15,000) except for the TAP-seq and enhancer screen datasets (which of course have fewer genes). Next, I plot the number of pairs across datasets.
The Frangieh data of course contain more pairs because they contain more NT gRNAs.
I applied the exact version of SCEPTRE (with covariates and using B = 300,000 permutations) to the Papalexi data and Frangieh data.
First, I plot the SCEPTRE exact results on the Frangieh data, comparing to (i) the skew-normal version of SCEPTRE (with covariates), (ii) NB regression, and (iii) SCEPTRE (SN). I plot the results twice: once without QC (top) and once with mild QC (bottom). (“Mild QC” entails filtering for pairs with at least 10 treatment cells with nonzero expression and 30 control cells with nonzero expression. See the next writeup (Writeup 8) for a justification of these thresholds.
SCEPTRE (exact) exhibits good calibration regardless of whether we apply QC. SCEPTRE (SN) and Seurat DE exhibit similar performance on the QC’ed and non-QC’ed data. Furthermore, these methods both perform better on the QC’ed data than on the non-QC’ed data, likely because the CLT-based approximation that these methods employ is better on the QC’ed data. Finally, NB regression does not seem to improve much (or at all) as we apply more stringent QC. This might be due to model misspecification for some fraction of hypotheses tested. The interaction between QC level and method performance is fairly striking (and perhaps expected, given the extent to which these methods rely on the CLT).
Next, I create the same set of plots for the Papalexi data. I plot SCEPTRE (exact), SCEPTRE (SN), Seurat DE, and NB regression. The top plot shows the results on the un-QC’ed data, and the bottom plot shows the results on the QC’ed data (where, again, QC entails filtering for pairs with >7 treatment cells with nonzero expression and >30 control cells with nonzero expression).
I compute the fraction of negative control pairs removed after applying QC. Note that the fraction of pairs filtered on the negative control data should be be much greater than that on the rest of the data, as the negative control data have one fewer NT in the control group and only a single gRNA in the undercover group. (In other words, the sample sizes are generally smaller on the negative control data than on the rest of the data.)
res |>
filter(dataset %in% c("frangieh_co_culture_gene",
"papalexi_eccite_screen_gene")) |>
group_by(dataset) |>
summarize(frac_removed = 1 - mean(n_nonzero_treatment >= 7 & n_nonzero_control >= 30))
## # A tibble: 2 × 2
## dataset frac_removed
## <chr> <dbl>
## 1 frangieh_co_culture_gene 0.434
## 2 papalexi_eccite_screen_gene 0.240
Interestingly, SCEPTRE (exact) is miscalibrated on the un-QCed data but is excellently calibrated on the QC’ed data. I conjecture that SCEPTRE (exact) is miscalibrated on the un-QC’ed data for the following reason: some of the pairs are confounded, and the sample size is too small for CAMP to have kicked in. On the QC’ed data, meanwhile, CAMP has kicked in, enabling SCEPTRE to return a well-calibrated p-value for the confounded pairs. In fact, on the QC’ed data, SCEPTRE (exact) exhibits the best calibration of all the methods and is even better than NB regression. This possibly is because NB regression is less robust than SCEPTRE to GLM model misspecification.
Two questions arise: first, why is SCEPTRE (SN) better calibrated than SCEPTRE (exact) on the non-QC’ed data? And second, why is NB regression better calibrated than SCEPTRE on the non-QC’ed data? I attempt to address the first question here. Below, I plot the SCEPTRE (exact) p-values against the SCEPTRE (SN) p-values on a transformed scale.
The SCEPTRE (SN) p-values are in broad agreement with the SCEPTRE (exact) p-values; however, the scatterplot has an obvious “appendage” below the diagonal. The pairs in this “appendage” have much smaller SCEPTRE (exact) p-values than SCEPTRE (SN) p-values. Nearly all the pairs in this appendage are filtered out by QC. Below, I create a similar plot to compare the SCEPTRE p-values to the negative binomial p-values.
These results are even more striking than the results comparing SCEPTRE (exact) to SCEPTRE (SN). There exists a set of pairs for which the NB regression p-value is about 1 and the SCEPTRE (exact) p-value is in the interval (0, 1). (These are the points at the bottom of the plot.) We see that very few of these pairs pass QC. What is the deal with these pairs? I filter for pairs with NB p-values greater than 0.99 and SCEPTRE (exact) p-values less than 0.5. (This is the row of points at the bottom of the plot to the right of the vertical black line.)
comparison_df |>
filter(p_value_nb > 0.99,
p_value_exact < 0.5) |>
summarize(m = mean(n_nonzero_treatment == 0)) |>
pull() * 100
## [1] 99.62238
Nearly all of these points (99.6%) have zero treatment cells with expression. We thus have an (at least proximate) explanation for the discrepancy between SCEPTRE (exact) and NB regression. When no treatment cell contains nonzero gene expression, SCEPTRE (exact) and NB regression both fail. SCEPTRE (exact) fails in a liberal direction, producing inflated p-values. NB regression, by contrast, fails in a conservative direction, yielding p-values equal to one. We likely will filter out such pairs as part of QC. Thus, on the pairs that we will subject to analysis, NB regression and SCEPTRE (exact) coincide fairly closely. (And in fact, SCEPTRE (exact) exhibits slightly better calibration on the QC’ed pairs, likely due to greater robustness to model misspecification.)
The results in this writeup are encouraging. After applying very mild QC, SCEPTRE emerges as the best method on both the Papalexi and Frangieh data. This is due to CAMP.
x <- res |>
filter(dataset == "schraivogel_enhancer_screen")
x |> filter(n_nonzero_treatment >= n_nonzero_treatment_thresh,
n_nonzero_control >= 30) |>
ggplot(aes(y = p_value,
col = Method)) +
stat_qq_points(ymin = 1e-9, size = 0.8) +
geom_abline(col = "darkred") +
stat_qq_band() +
theme_bw() +
scale_x_continuous(trans = revlog_trans(10)) +
scale_y_continuous(trans = revlog_trans(10)) +
labs(x = "Expected quantile", y = "Observed quantile") +
theme(legend.position = "bottom",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank()) +
ggtitle("QC")